[Hardware][Xilinx] explicitly specify acc dep distance to avoid hidden pitfall #8

zhanghaohit · 2020-04-29T09:24:12Z

Bug

Unrestrained #pragma HLS DEPENDENCE in VTA core waiving all loop-carried dependencies, causing wrong results in 32x32 configuration

Solution

make acc dependence configurable, and also do corresponding checking in VTA runtime (see here).

zhanghaohit · 2020-04-29T09:47:25Z

@tqchen @tmoreau89 Could you kindly help review this PR? Thanks.

tmoreau89 · 2020-04-30T01:06:10Z

Thanks @zhanghaohit, for catching this error. I agree with the fact that the dependence should not be hardcoded. However in order to not to add too many parameters in the config file, can we derive the value from the VTA target (i.e. FPGA type) in pkg_config?

Also from what I understand is that the dependence distance is generally a property of the memory part that the FPGA is using. For instance that distance will be different if one uses BRAM vs. Ultra-RAM? And on a different FPGA family this number will change. Correct?

tmoreau89

See comments above

zhanghaohit · 2020-04-30T02:01:50Z

Thanks @zhanghaohit, for catching this error. I agree with the fact that the dependence should not be hardcoded. However in order to not to add too many parameters in the config file, can we derive the value from the VTA target (i.e. FPGA type) in pkg_config?

Also from what I understand is that the dependence distance is generally a property of the memory part that the FPGA is using. For instance that distance will be different if one uses BRAM vs. Ultra-RAM? And on a different FPGA family this number will change. Correct?

I think the dependence distance here is defined by the user, to tell the compiler (e.g., XILINX vivado) that, to what extent, the software layer (i.e., TVM code generator) can guarantee there is no acc memory conflict. Say, if the distance = 4, meaning that the software will guarantee there is no conflict within 4 iterations. So the compiler can generate hardware code with smaller II as it is not necessary for the compiler to generate code with larger II to eliminate memory conflict from the hardware perspective.

Thus, I also add the check in VTA runtime to do the VerifyDep.

remotego · 2020-04-30T03:36:52Z

Thanks @zhanghaohit, for catching this error. I agree with the fact that the dependence should not be hardcoded. However in order to not to add too many parameters in the config file, can we derive the value from the VTA target (i.e. FPGA type) in pkg_config?

Also from what I understand is that the dependence distance is generally a property of the memory part that the FPGA is using. For instance that distance will be different if one uses BRAM vs. Ultra-RAM? And on a different FPGA family this number will change. Correct?

Thanks @tmoreau89 for the advice. We also thought of putting the value into the pkg_config. However, later we found out the dependence distance is not tightly related to the device family/type.

In my opinion, the compiler will decide the II based on multiple factors. But eventually, it is based on the number of cycles on the datapath from "Read Mem" -> "Perform Ops" -> "Write Back", as we need to avoid RAW data hazard.

In the "Read Mem" and "Write Back" stage, it is definitely related to the properties of the memory (eg. device family, uram vs bram, etc). But it is also related to the H/W circuit of accessing the data. For example, multiplexers as there are multiple accesses. The compile will judge based on the complexity of the overall circuit, and it could add registers (increase II) if the desired frequency target could not be satisfied.

In "Perform Ops" stage, the cycles needed is mainly related to the op itself. For example, a 32-bit multiplication may require more cycles than a 8-bit multiplication.

tmoreau89 · 2020-04-30T03:55:25Z

Thank you for the explanation. Do you mind breaking down what happens in the 32x32 configuration that causes correctness to fail (since the dependence analysis waive is unbounded)?

I agree that there should be a constant set for hardware and software to agree on, but I'm not convinced that it should be set by a user, and instead be derived automatically by the hardware target we choose in pkg_config. This is to ensure that we can achieve an iteration interval of II=1 for the GEMM pipeline, which is constrained by the SRAM write to read latency, and then enforced by the software.

The example I can use is that if we are say we want to accumulate SRAM at address A and B for 3 consecutive cycles. We could do it (1) by updating each address consecutively, or (2) by interlacing the updates.

// initial contents of SRAM at address A is 20, and B is 14
// here we assume 2 cycles to propagate a write to being able to read it
t=0: SRAM[A] += 1 // SRAM[A] is 20, SRAM[B] is 14
t=1: SRAM[A] += 1 // SRAM[A] is 20, SRAM[B] is 14
t=2: SRAM[A] += 1 // SRAM[A] is 21, SRAM[B] is 14
t=3: SRAM[B] += 1 // SRAM[A] is 21, SRAM[B] is 14
t=4: SRAM[B] += 1 // SRAM[A] is 22, SRAM[B] is 14
t=5: SRAM[B] += 1 // SRAM[A] is 22, SRAM[B] is 15
t=6: noop // SRAM[A] is 22, SRAM[B] is 15
t=7: noop // SRAM[A] is 22, SRAM[B] is 16

// Results are incorrect: A final value is 22 instead of 23, and B final value is 16 instead of 17

Now in the case where these back to back writes are detected in the runtime, we would have an invalid schedule. So assuming the TVM compiler produces a valid schedule, we would end up with interlaced writes to A and B:

// initial contents of SRAM at address A is 20, and B is 14
// here we assume 2 cycles to propagate a write to being able to read it
t=0: SRAM[A] += 1 // SRAM[A] is 20, SRAM[B] is 14
t=1: SRAM[B] += 1 // SRAM[A] is 20, SRAM[B] is 14
t=2: SRAM[A] += 1 // SRAM[A] is 21, SRAM[B] is 14
t=3: SRAM[B] += 1 // SRAM[A] is 21, SRAM[B] is 15
t=4: SRAM[A] += 1 // SRAM[A] is 22, SRAM[B] is 15
t=5: SRAM[B] += 1 // SRAM[A] is 22, SRAM[B] is 16
t=6: noop // SRAM[A] is 23, SRAM[B] is 16
t=7: noop // SRAM[A] is 23, SRAM[B] is 17

// Results are now correct because the compiler correctly ensured enough wait time was allowed between consecutive writes and reads to the same address in SRAM.

So therefore the point I'm hoping to make is that this 2 cycle wait time in enforced by the SRAM, and should not be exposed to the user to be set as a parameter in VTA_config.

tmoreau89 · 2020-04-30T03:58:27Z

@remotego I just read your comment that must have been posted as I wrote my reply. Your explanation makes sense, ultimately the critical path in the accumulation is READ -> ADD to GEMM result -> WRITE, which can be more than 2 cycles if we account for the addition logic, and various multiplexers etc.

tmoreau89 · 2020-04-30T04:00:07Z

I assume that if you have an FPGA that needs to be clocked at a high frequency, you'll likely end up with more cycles.

tmoreau89 · 2020-04-30T04:03:43Z

So to conclude, I agree with you that we may want this parameter to be larger than 2 cycles if the hardware pipeline becomes more complicated.

So the big question here is: how can we ensure that it is set correctly moving forward. For instance a DISTANCE=2 works for 16x16, but might break for 32x32. Is there a way that the user as she performs design space exploration does not have to think about updating this parameter? Or at least receives some guidance to update it to the right value?

tmoreau89 · 2020-04-30T04:04:48Z

Perhaps one way to look at this is to start with DISTANCE=3 by default. And if the II>1 for the GEMM, we issue a warning telling the user to increase the distance to 4. The process can repeat until the II of 1 is achieved, and the runtime will be informed of the actual distance so proper checks can be inserted.

tmoreau89 · 2020-04-30T04:05:36Z

I'll think some more about this problem overnight, thanks for bringing attention to this issue @zhanghaohit @remotego

remotego · 2020-04-30T04:12:06Z

Perhaps one way to look at this is to start with DISTANCE=3 by default. And if the II>1 for the GEMM, we issue a warning telling the user to increase the distance to 4. The process can repeat until the II of 1 is achieved, and the runtime will be informed of the actual distance so proper checks can be inserted.

I love this idea.

Our ultimate goal is to achieve ii = 1, thus it would be great if we could achieve it by some intelligent scripting... only problem is that we need to introduce a way to feedback the results to S/W side.

zhanghaohit · 2020-05-17T08:11:12Z

Perhaps one way to look at this is to start with DISTANCE=3 by default. And if the II>1 for the GEMM, we issue a warning telling the user to increase the distance to 4. The process can repeat until the II of 1 is achieved, and the runtime will be informed of the actual distance so proper checks can be inserted.

Thanks @tmoreau89 for the advice.

For this issue, do you suggest to do it manually? For example, we have a config (e.g., ACC_DEP_DISTANCE) set with a default value (say, 3). If II > 1, we just issue a warning message and ask the user to enlarge this config if he want to achieve II = 1?

liangfu · 2020-05-31T04:45:04Z

I would agree with

derive the value from the VTA target (i.e. FPGA type) in pkg_config

, and avoid requiring a user to specify the ACC_DEP_DISTANCE parameter.

tmoreau89 · 2020-06-11T03:41:57Z

@zhanghaohit I think that the warning approach is a reasonable compromise between user-defined flexibility and making sure that they set the value correctly.

To echo @liangfu , it would be ideal to derive the value in pkg_config based on simple rules of FPGA target, and matrix multiplication size. We can tweak those rules if say we discover that the dependence distance needs to be increased.

If you agree I suggest we go ahead with these changes.

This PR add support for cross compiling dylib to x86 macos Co-authored-by: tqchen <[email protected]>

bugfix: explicitly specify acc dep distance to avoid hidden pitfall

6fb9897

zhanghaohit mentioned this pull request Apr 29, 2020

[VTA][Runtime] fix hardcoded VerifyDep step apache/tvm#5471

Closed

tmoreau89 requested changes Apr 30, 2020

View reviewed changes

tqchen changed the base branch from master to main October 11, 2020 17:43

jinhongyii pushed a commit that referenced this pull request Sep 5, 2023

Support for cross compiling dylib to x86 macos (#8)

ae08d4c

This PR add support for cross compiling dylib to x86 macos Co-authored-by: tqchen <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hardware][Xilinx] explicitly specify acc dep distance to avoid hidden pitfall #8

[Hardware][Xilinx] explicitly specify acc dep distance to avoid hidden pitfall #8

zhanghaohit commented Apr 29, 2020 •

edited

Loading

zhanghaohit commented Apr 29, 2020

tmoreau89 commented Apr 30, 2020

tmoreau89 left a comment

zhanghaohit commented Apr 30, 2020

remotego commented Apr 30, 2020

tmoreau89 commented Apr 30, 2020

tmoreau89 commented Apr 30, 2020

tmoreau89 commented Apr 30, 2020

tmoreau89 commented Apr 30, 2020

tmoreau89 commented Apr 30, 2020

tmoreau89 commented Apr 30, 2020

remotego commented Apr 30, 2020

zhanghaohit commented May 17, 2020 •

edited

Loading

liangfu commented May 31, 2020

tmoreau89 commented Jun 11, 2020

[Hardware][Xilinx] explicitly specify acc dep distance to avoid hidden pitfall #8

Are you sure you want to change the base?

[Hardware][Xilinx] explicitly specify acc dep distance to avoid hidden pitfall #8

Conversation

zhanghaohit commented Apr 29, 2020 • edited Loading

Bug

Solution

zhanghaohit commented Apr 29, 2020

tmoreau89 commented Apr 30, 2020

tmoreau89 left a comment

Choose a reason for hiding this comment

zhanghaohit commented Apr 30, 2020

remotego commented Apr 30, 2020

tmoreau89 commented Apr 30, 2020

tmoreau89 commented Apr 30, 2020

tmoreau89 commented Apr 30, 2020

tmoreau89 commented Apr 30, 2020

tmoreau89 commented Apr 30, 2020

tmoreau89 commented Apr 30, 2020

remotego commented Apr 30, 2020

zhanghaohit commented May 17, 2020 • edited Loading

liangfu commented May 31, 2020

tmoreau89 commented Jun 11, 2020

zhanghaohit commented Apr 29, 2020 •

edited

Loading

zhanghaohit commented May 17, 2020 •

edited

Loading